Quantification of the differences between quenched and annealed averaging for RNA 

secondary structures 
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The analytical study of disordered system is usually difficult due to the necessity to perform a 
quenched average over the disorder. Thus, one may resort to the easier annealed ensemble as an 
approximation to the quenched system. In the study of RNA secondary structures, we explicitly 
quantify the deviation of this approximation from the quenched ensemble by looking at the correla- 
tions between neighboring bases. This quantified deviation then allows us to propose a constrained 
annealed ensemble which predicts physical quantities much closer to the results of the quenched 
ensemble without becoming technically intractable. 

PACS numbers: 87.14.Gg, 87.15-v, 05.70.Fh 



I. INTRODUCTION 

Heteropolymer folding is of crucial significance in 
molecular biology. It is the basis for the mechanism 
with which cells can produce three dimensional build- 
ing blocks out of the one-dimensional information stored 
in their genome. Cells achieve this by forming (still one- 
dimensional) polymers (proteins and RNA) by stringing 
together different monomers with covalent bonds. All 
monomers share a compatible backbone but they have 
different side chains and occur in a predefined order 
along the sequence. Physical interactions between these 
monomers force the polymer to stably fold into a three 
dimensional structure. This structure is crucial for the 
function of the molecule; it is determined by the specific 
sequence of the polymer 0, 0, 0, ■ 

In addition to its biological relevance, heteropolymer 
folding is also a very interes ting pr oblem of statistical 

mechanics |ElllllllS|lS|ll|llM|lllIllIllI3- Th e 

competition between the configurational entropy of the 
polymer, the overall tendency of the monomers to stick to 
each other, the sequence disorder, and the preference of 
folding toward a biologically active native state, leads to a 
very rich thermodynamic phase diagram. While the same 
qualitative behavior is expected for proteins and RNA, 
we will here concentrate on RNA since RNA folding is 
more amenable to analytical and numerical approaches 
than protein folding. The relative simplicity of the RNA 
folding problem compared to the protein folding problem 
does not stem from the fact that RNA consists of only 
four different bases versus the twenty amino acids the 
proteins are composed of, but it comes from the simpler 
interaction rules: The dominant interaction between the 
four bases A, U, G, and C of an RNA molecule is Watson- 
Crick (G-C and A-U) pair formation, i.e., if two bases 
have formed a pair they to first order do not take part in 
any further interactions. Every amino acid of a protein 
on the contrary interacts with all its spatial neighbors, 
i.e., with on the order of ten other amino acids at a time. 

From a statistical physics point of view, the possibility 
of a glass phase at low temperatures driven by sequence 
disorder, is of special interest in the heteropolymer fold- 



ing problem 0, EH El 13 . Unfortunately, even 
for the case of RNA folding an analytic quantitative de- 
scription of the glass phase is still outstanding. Thus, 
quantitative studies have to either rely on numerics or 
they have to use what is known as the annealed average. 
In the annealed average, the free energy of the system is 
approximated as the logarithm of the ensemble averaged 
partition function (instead of taking the ensemble aver- 
age over the logarithm of the partition function called 
the quenched average). Physically, this approximation 
corresponds to treating the sequence degrees of freedom 
as dynamical instead of frozen variables. Thus, the an- 
nealed system represents a sequence ensemble that is cou- 
pled to the structural ensemble by way of the interaction 
energies. This sequence ensemble may be different from 
the original sequence ensemble of uncorrelated random 
sequences over which the free energy is supposed to be 
averaged. Due to these differences between the annealed 
and the quenched sequence ensemble the annealed free 
energy is only an approximation to the true (quenched) 
free energy of a disordered system. 

The purpose of this manuscript is to first quantify the 
differences between the annealed and the quenched se- 
quence ensembles. Specifically, we will look at correlation 
between neighboring bases. We show that while this cor- 
relation is strictly zero in the correct (quenched) sequence 
ensemble, they are non-zero in the annealed sequence en- 
semble and increase with decreasing temperature - up to 
complete correlation in certain models of RNA folding. 
This clearly underlines and quantifies the fundamental 
shortcomings of the annealed average in the RNA fold- 
ing problem at low temperatures. 

Based on the quantified non-zero nearest neighbor cor- 
relations, we then try to diminish the differences be- 
tween the annealed and quenched ensembles by forcing 
the annealed ensemble to present zero neighboring cor- 
relation. This constrained annealed ensemble behaves 
much more similar to the quenched ensemble than the 
annealed ensemble. Although the glass phase itself can 
not be identified using the constrained annealed ensem- 
ble which only partially corrects the overall non-random 
correlations, one can obtain thermodynamic quantities 
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which are much closer to the quenched results than the 
annealed ones using this method. 

This paper is organized as follows: In Sec. II, we briefly 
review the RNA secondary structure and introduce the 
general RNA folding problem with sequence disorder. In 
Sec. Ill, we quantify the deviation of nearest neighbor 
correlations of the annealed ensemble. Finally, we im- 
prove the pure annealed ensemble by applying a con- 
straint of random correlations in Sec. IV. 



II. RNA FOLDING PROBLEM WITH 
SEQUENCE DISORDER 

A. RNA secondary structures 

RNA is a single-stranded biopolymer of four different 
bases A, U, C, and G. The strand can fold back onto itself 
and form helices consisting of stacks of stable Watson- 
Crick pairs (A with U or G with C). This comparatively 
simple interaction scheme makes the RNA folding prob- 
lem very amenable to theoretical approaches without los- 
ing the overall flavor of the general biopolymer folding 
problem [jj. 

An RNA secondary structure S is characterized by its 
set of Watson- Crick base pairs where i and j denote 
the i th and j th base of the RNA polymer respectively 
(conventionally % < j). Here, we follow many prev ious 
studies |llllilili|lS|ll|ll|ll|ll|ll EE 113 and 
apply the reasonable approximation to exclude so-called 
pseudoknots E3> i- e -> f° r two Watson-Crick pairs £ 
S and (k,l) € S, configurations with i < k < j < I 
are not allowed. This approximation is justified, because 
short pseudoknots do not contribute much to the overall 
energy and long pseudoknots are kinetically difficult to 
form. 



B. Quenched averaging 

The properties of RNA folding, especially the possibil- 
ity of a glass phase driven by the sequence disorder, have 
been a challenging problem from the statistical physics 
points of view. To understand the statistics of this dis- 
ordered system, one first has to assign an energy E(x, S) 
to every secondary structure S for a given sequence X- 
This could, e.g., simply be the negative of the total num- 
ber of Watson-Crick base pairs. This then allows us to 
calculate the partition function 

^(x) = £n(x,s) e -^ s >/ r (i) 

s 

for a given sequence \ where n(x, S) is one when the 
secondary structure S is compatible with the sequence 
X and zero otherwise. Finally, one has to calculate the 
quenched average 

F q = -k B T{\nZ( X )) x (2) 



over all sequences x- 



C. Annealed averaging 

Unfortunately, the quenched free energy F q is very dif- 
ficult to calculate. Thus, one can try to approximate the 
quenched free energy by the much easier computed an- 
nealed free energy, which treats the disordered sequences 
as dynamic variables. This annealed free energy is only 
a lower bound of the quenched free energy, 

F a = -k B T\n{Z{ X )) x <F q . (3) 

It can be quite different from the quenched free en- 
ergy since the annealed ensemble favors those sequences 
where more binding pairs are allowed. More importantly, 
physical quantities derived from this annealed free energy 
can be very different from their quenched counterparts as 
we will show explicitly in the following sections. To be 
specific, we will measure the correlation between neigh- 
boring bases which are known to vanish in the quenched 
case. 



D. Energy models 

In this paper, we study the simplest model of disor- 
dered RNA sequences which contain only the two bases A 
and U. In assigning free energies to secondary structures, 
we neglect any loop entropies and focus on the base pairs 
alone. Besides, for most parts of this manuscript, we do 
not consider the minimal hairpin length constraint which 
requires the two bases of a binding pair to be separated 
by at least three bases in a real RNA molecule. Within 
these approximations we do consider two different energy 
models. 

In the binding energy model, we simply assign an en- 
ergy e = — 1 to each AU (or UA) binding pair. We denote 
the corresponding Boltzmann factor by q = e 1 / 7 '. This 
model captures the main features of the energetics and 
is simple enough for analytical and numerical studies. 

We also study a somewhat more realistic energy model, 
namely the stacking energy model. In this model, we as- 
sign energies to the stacking of two base pairs rather than 
to individual base pairs. This stacking energy depends 
in reality on the identities of all four bases involved. We 
implement this effect by associating a Boltzmann factor 
S\ with stackings of types uu and AA while associating a 
different Boltzmann factor S2 with stackings of types ^ 
and . To be specific, we will choose these Boltzmann 
factors as s\ — e 2 / T and = e x l T for the remainder of 
this communication. 

The main reason to study the stacking energy model 
is that the simple binding energy model is known to be 
pathological without a glass phase at low temperature 
in the disordered sequence ensemble 0, H, • A simple 
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reason is that whatever the sequence, each base A can al- 
ways find another base U to pair with provided we have 
the same amount of bases A and U. Thus, sequences dis- 
order does not cause frustration. In contrast, the energy 
distribution of the stacking energy model is greatly af- 
fected by sequences, and a structure in which all base 
pairs are stacked can in general not be found for every 
sequence. Thus, sequence disorder is expected to cause 
frustration, and a glass phase is expected in this energy 
model for low enough temperature. 

III. NEAREST NEIGHBOR CORRELATIONS 
OF THE ANNEALED ENSEMBLE 

In this section, we calculate quantitatively how the 
nearest neighbor correlations in the annealed ensemble 
deviate from their true values in the random sequence 
ensemble. To this end, we have to calculate the annealed 
partition function for sequences with length N-l, which 
is defined as 



Z a {N) = 




A A U 

U U A 

FIG. 1: Recursive relation exploring all possible secondary 
structures for a homogeneous sequence of length N. The wavy 
lines stands for contribution from all possible structures and 
sequences. The straight line stands for non-paired bases. 

For the binding energy model, this annealed partition 
function can be easily obtained via the recursive rela- 
tion shown in Fig . ^ along the lines of previous stud- 
ies [lfj, [19J, l20l l2ll |22( but taking the sequences into ac- 
count explicitly. The idea is to separate the two cases 
for the last base, which is either unbound or bound to a 
certain base k, and then relate the partition function to 
the shorter length one as 

Z a (N + l;q) = Z a {N;q) (5) 

N-l 

+ \ E Z a (k;q)Z a (N-k;q). 
k=l 

With this relation, one can obtain an analytical formula 
for the annealed partition function in the large N limit 
by performing the z-transform, which is defined as 

00 

T a {z;q) = Y t Z a {N;q)z- N , (6) 

iV=l 

on the recursive relation. After solving the resulting 
quadratic equation for Z a (z; q), we can obtain the parti- 



tion function by doing the inverse z-transform, 

Z a (N; q) = ± I £(z; q)z N ~ x dz. (7) 

This approach can be easily generalized to the stacking 
energy model. 

In order to keep track of the correlations by the an- 
nealed ensemble, we assign an additional Boltzmann fac- 
tor L to all AA and UU neighbors within the sequence. 
The modified annealed partition function is then 

Z a {N- q, L) = ^ J2 (j2 n (*' S)g n * (S) L"^ , 

(8) 

where n q (S) is the number of binding pairs in a secondary 
structure S, and til{x) is the number of conjugate neigh- 
bors, i.e., AA and UU neighbors in the sequence. 

The additional Boltzmann factor complicates the cal- 
culation of the partition function since different bases A 
and U contribute differently. However, we can still formu- 
late recursive relations by noticing that the two end bases 
of a sequence piece determine the correlations with other 
pieces. Thus, we can separate a sequence into two cases 
where the end bases are either of the same type or not, 
and formulate the recursive relation for each case inde- 
pendently. The annealed partition function Z a {N;q,L) 
is then obtained via z-transform as before. Since the for- 
mation of the recursive relations is quite technical, we 
only address the result here, and defer the details to Ap- 
pendix A. 

From the partition function we can obtain the near- 
est neighbor correlations by looking at the deviation of 
the averaged fraction of AU (or UA) neighbors from the 
expected value 1/2 in the disordered sequence ensemble. 
This deviation S is obtained by taking the derivative as 

5= l --^Ld L \n{Z a {N-q,L))\ L=1 . (9) 

A. Binding energy model 

Fig. [3 shows the neighbor correlations for the binding 
energy model. We find that the deviation moves further 
away from zero as temperature decreases. This is a di- 
rect result from the fact that at low temperature, the 
main contributions to the annealed partition function 
come from those sequences which allow a lot of bind- 
ing pairs, unlike the quenched case where sequences are 
equally weighted. 

The exact way that the neighbor correlations are bi- 
ased can be understood as follows. In this binding energy 
model, the only thing that biases the nearest neighbor 
correlations is the formation of minimal hairpins since 
they enforce the neighboring bases to be different, which 
are either AU or UA. Thus, the degree of bias is directly 
coupled to the fraction of smallest hairpins in a sequence. 
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FIG. 2: Deviation of the fraction of AU (or UA) nearest neigh- 
bors. The deviation is plotted as a function of temperature 
in units of the binding energy for the binding energy model. 
Notice that the deviation moves further away from zero and 
stops at a fixed constant as temperature decreases. It also 
approaches a limit larger than zero at high temperature indi- 
cated by the dashed line. 



B. Stacking energy model 

Following the same approach, we check the same devi- 
ation as a function of temperature in the more realistic 
stacking energy model. Again, only the result is quoted 
here in Fig. [31 (interested readers can check the detailed 
calculations in Appendix C). 




This assertion can be verified by studying the fraction 
of minimal hairpins. As an example, we study the zero 
temperature case where all the bases are expected to be 
paired. Among all possible pairing structures, we explic- 
itly calculate the fraction of smallest hairpins (with the 
details shown in Appendix B). As a result, every fourth 
base is part of a minimal size hairpin. Thus, we have 1/4 
AU (or UA) nearest neighbors from these hairpins and 
another 1/2x3/4 = 3/8 from the rest of the bases since 
they do not show nearest neighbor correlation bias. The 
deviation of the fraction of AU (or UA) neighbors is then 
expected to be 5/8— 1/2 = 1/8, which matches exactly 
the zero temperature limit in Fig. [21 In this case, the 
sequence, as a dynamic variable, adjusts itself to all the 
binding pairs. 

Even in the high temperature limit, although all al- 
lowed sequences are equally weighted, there still exists a 
finite fraction of minimal size hairpins on average. As a 
result, the deviation of neighbor correlations approaches 
a constant larger than zero. 

The assertion that the deviation 6 is coupled to the 
formation of minimal size hairpins is again verified as we 
additionally require all the hairpins being of length larger 
than one. In this case, the correlation between nearest 
neighbors becomes random at all temperatures. How- 
ever, the second nearest neighbor correlations become 
non-trivial. 

This simple binding energy model gives us a taste how 
the nearest neighbor correlations are coupled with the en- 
ergy through the structure, i.e., the formation of minimal 
hairpins. This correlation is biased since the annealed en- 
semble puts more weight on lower energy sequences. 



FIG. 3: Deviation of the fraction of AU (or UA) nearest neigh- 
bors for the energy model involving stacking energies. Unlike 
in the case of the binding energy model, the AU (or UA) 
neighbor correlations are completely biased at zero tempera- 
ture in the stacking energy model. At high temperature, this 
deviation approaches the same limit as the binding energy 
model. 

Unlike the binding energy model, at zero temperature, 
the nearest neighbor correlations of the stacking energy 
model are completely biased. Almost no AU (or UA) 
neighbors can be found in this annealed system. This 
can be understood since at zero temperature, the only 
dominating structure is a long stem in which all stack- 
ing loops are of type si. Thus, the only two important 
sequences are the ones made of half consecutive A bases 
and the other half of U bases. 

To verify this structure, we additionally introduce an- 
other Boltzmann factor h for each hairpin loop formation. 
With this Boltzmann factor we can keep track of the frac- 
tion of hairpins fh in the annealed system by calculating 



fh = jjhd h HZ a {N- Sl , s 2 ,h, L = l))U=i. (10) 

From Fig. 01 we do see that the fraction of hairpins of this 
annealed system indeed goes to zero as temperature goes 
to zero, which is a feature of the long stem structure. 

At high temperature, however, the energy model does 
not matter since entropy dominates. Thus, the AU (or 
UA) fraction approaches the same limit as in the binding 
energy model. 

From this stacking energy model, we learn that the 
stronger the energy is coupled to the nearest neighbor 
correlations, the larger deviation in nearest neighbor cor- 
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FIG. 4: Fraction of hairpins in the stacking energy model for 
three different ensembles. 



relations of the annealed system will be present at low 
temperature. 



IV. CONSTRAINED ANNEALING 

So far we have only observed the sequence correla- 
tions artificially introduced through the annealed ensem- 
ble. However, our approach can in fact be used to gener- 
ate more realistic ensembles within the annealed frame- 
work. The idea is to force the nearest neighbor correla- 
tions to be random when performing the annealed aver- 

age mm. 

We simply enforce this random disorder constraint, i.e., 
the fraction of AU (or UA) neighbors being one half by 
setting the Boltzmann factor L, which controls the near- 
est neighbor correlations, to whatever value it needs to 
have for the correlations of the annealed ensemble to van- 
ish. 

This constrained annealing turns out to predict ther- 
modynamic quantities much closer to the quenched re- 
sults. And it can be done immediately following our 
quantified deviations in disorder. 



A. Binding energy model 



The constraint for the binding energy model is read as 



1 

N 



L&l ln(Z a (N; q, L))\ 



L=L C 



fill 



In this energy model, we expect the sequences with more 
AU (or UA) nearest neighbors to be suppressed since the 
annealed system favors those neighbors. As a result, L c , 
which favors A A (or UU) neighbors, is expected to be 
larger than one in order to meet the constraint. Further- 
more, L c should be larger at lower temperatures since 
the neighbor correlation is more biased at lower temper- 
atures. 



One important note is that the resulting free energy 
is only defined up to an additive constant, i.e., adding 
a constant background potential does not change the re- 
sult at all. Thus, the absolute value of this constrained 
annealed free energy as well as the Boltzmann factor L c 
has no real meaning. For example, one could assign the 
Boltzmann factor L to AU (or UA) neighbors instead of 
AA (or UU) neighbors. The resulting chemical potential 
would then change a sign and the free energy would dif- 
fer by a constant amount. However, the thermodynamic 
quantities, which are calculated by taking derivatives of 
the constrained free energy, will not see this constant and 
are expected to be closer to the quenched result. 

To verify this assertion, we are going to compute the 
average fraction of binding pairs for the binding energy 
model via q/Nd q \n(Z a (N;q,L)) as a function of tem- 
perature. Then, we compare the cases of the annealed 
(L = 1), the constrained annealed (L = L c ) and the 
quenched ensembles. 

As to the quenched result, we numerically calculate the 
partition function given random sequences of length 1280 
and collect the data from 1000 random sequences. In or- 
der to avoid the trivial finite size effects due to fluctuation 
of the fraction of A bases away from its expected value 
1/2, we only choose sequences that contain exactly 640 
A's and 640 U's. The result is shown in Fig. EJ The sta- 
tistical errors of the quenched results are always smaller 
than the size of the corresponding symbol, such that 
within the error bars the quenched results never overlap 
other curves. This condition holds for all other quenched 
results in this manuscript. 
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FIG. 5: Fraction of binding pairs in the binding energy 
model. The constraint of random nearest neighbors brings 
the annealed quantity closer to the quenched numerical esti- 
mate. The statistical errors of the quenched results are always 
smaller than the size of the symbol. 

We see that the constrained annealed result is indeed 
very close to the quenched numerical estimate. However, 
all three results are rather close to each other anyway. 
The reason for these three cases being so close to each 
other is simply that under this energy model the system 
is not glassy, and every base is able to find another base 
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for pairing in this binding energy model. Thus, at zero 
temperature, all the bases are paired in all three systems. 
The fact that the nearest neighbor correlations are not bi- 
ased a lot can also be verified as we find that at T=0.1, L c 
to be just 1.59. Thus, the chemical potential introduced 
from the constraint is comparatively small and does not 
affect the result too much. 



B. stacking energy model 

The situation for the stacking energy model is very 
different from that of the binding energy model. Here, 
we follow the same approach and compute the averaged 



fraction of stacking loops of type ^ (or V ^ A ) and ^ (or F jq - 



uu ■ 



AU 



UA> 



as a function of temperature under the constraint, 
^Ld L \n(Z a (N; Sl , s 2 ,h = 1, L))\ L=Lc = i. (12) 



Similarly, in order to avoid the trivial finite size effects 
for the quenched numerical estimate, we fix the number 
of AA, AU, UA, UU nei ghbo rs in the randomly chosen 
sequences to be 320 each |25j . 




FIG. 6: Fraction of stacking loops ^ (or ^) in the stack- 
ing energy model. The constraint of random nearest neigh- 
bors fixes this quantity much better than averaged number 
of pairs in the binding energy model. The phenomenolog- 
ical constraint, i.e., a fixed fraction of hairpins, brings this 
quantity only a bit closer to the quenched result. 

From Figs. El and we see that the constrained an- 
nealed results are greatly improved over the plain an- 
nealed results. This verifies the idea that larger devia- 
tions from the random disorder result in a better cor- 
rection via the constraint of the random disorder. For 
this stacking energy model, at T=0.1, -L c =0.0067 is much 
more different from 1 than in the binding energy model. 

From these results, we can see that the constrained 
annealed ensemble of the stacking energy model behaves 
in the following way. Since the ensemble is forced to 
have the same number of AA (or UU) and AU (or UA) 
neighbors, at zero temperature, the dominating structure 
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Fraction of stacking loops ^ (or ^ ) 
ing energy model. Again, the constraint of random nearest 
neighbors greatly improves the result. However, unlike the 
case in Fig. |SJ the constraint of a fixed fraction of hairpins 
also contributes in bringing the annealed quantity closer to 
the quenched result. 



is still a long stem structure, but with half the stacking 
loops being of type si and the other half being S2- This 
is consistent with the fact that fraction of hairpins going 
to zero as temperature goes to zero for the constrained 
annealed system as shown in Fig. 0] . 

One difference between the quenched ensemble and the 
constrained annealed ensemble is that not all the bases 
of a random sequence can form stacking loops. Thus, we 
have a finite fraction of hairpins in the quenched ensem- 
ble (Fig. . This difference can used as an additional 
phenomenological constraint to improve the constrained 
annealed system even further. 

We apply this additional phenomenological constraint 
by requiring the fraction of hairpins fh to fit the quenched 
numerical estimates and neighboring bases to be uncor- 
rected at the same time, i.e., to enforce 



N 



d L ln(^ a (iV; Si,s 2 , h, L))\h=h c ,i 



(13) 



d h ln(Z a (N;si,S2,h,L))\ h=hc , L=Lc = fh(T), (14) 



where fh(T) is the quenched numerical estimate in this 
equation. 

From Figs. and we see that this additional con- 
straint slightly improve the fraction of stacking loops s±, 
but significantly improves the fraction of stacking loops 
S2- This can be understood since the existence of hair- 
pins introduces AU (or UA) neighbors, if the fraction of 
AU (or UA) neighbors is also required to be one half, it 
will decrease the fraction of stacking loops S2 among the 
stem structures. 



V. CONCLUSION 

We conclude that the deviation of the annealed en- 
semble from the quenched ensemble is strongly related 



to the energy model and can be completely biased when 
the correlation is strongly coupled to the energy of the 
system. Quantifying this deviation allows us to do con- 
strained annealing which brings the predictions of ther- 
modynamic quantities much closer to the real values in 
the quenched ensemble. As the deviation is larger, the 
constraint is stronger and thus brings the annealed en- 
semble even closer to the quenched results. Unfortu- 
nately, the biasing toward the quenched ensemble is not 
strong enough to actually drive the system into the glass 
transition. 

Besides the nearest neighbor correlations, one could 
also consider the correlations for next nearest neighbors 
or even two bases separated by arbitrary distances. In 
principle, all these correlations together would bring us 
to the exact quenched results and thus to the glass transi- 
tion. However, the calculations become much more cum- 
bersome as one increases the distance between the two 
bases, and are left for future work. 
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FIG. 8: Recursive relation for the annealed partition func- 
tion over heterogeneous sequences where the first and the last 
bases form a conjugate pair. A letter 'E' is used to denote 
that the two bases at the ends of the arch are conjugate bases. 




1 N-IN 1 N-IN 1 N-IN 

A UA UUA AU 



FIG. 9: Decomposition of an arch with its last base inside 
being a free base, which can be either A or U, into two cases. 
The letter 'O' is used to denote that the two bases at the ends 
of the arch are non-conjugate. 



VI. APPENDIX 

A. Annealed partition function for the binding 
energy model 

The annealed partition function is obtained by first 
summing over all compatible sequences given a secondary 
structure S and then summing over all possible structures 
S, which can be done via the recursive relation in Fig. ^ 
We define the annealed partition function for a sequence 
of length N as Z a (N +1). In addition, the annealed 
partition function for a sequence of length N with its two 
end bases paired is defined as A e (N — 1). The recursive 
relation in Fig. ^ is then read as 

N— 1 

Z a (N+l) = l -±L Za (N)+Y, ^-^Z a (k-l)A e (N-k). 

k=l 

(15) 

The factor (1 + L)/2 for the first term on the right hand 
side comes from the contribution in nearest neighbor cor- 
relations between the free base N and base N-l, and the 
2 takes care of averaging over the number of sequences. 
We have a similar factor in the second term coming from 
the correlation between base k of the arch and base k-1. 
In the later part we will show that the behavior of the 
annealed partition function is mainly determined by the 
arch term A e , so we will only look at this quantity here. 
The first base of A e is also specified to be A and the last 
base to be the conjugate base U. 

Again, the annealed partition function for the arch can 
be obtained through a similar recursive relation (Fig. |HJ . 
The two terms on the right hand side are further decom- 
posed in Figs. 151 and HDI 

In these relations, we need to keep track of two factors: 
the energy contributions and the nearest neighbor corre- 
lations. From the energetic point of view, an arch can 
be thought of simply contributing a Boltzmann factor q 



and need not stand for a real binding pair, even though 
initially it is used to represent a real binding pair. Thus, 
in Fig. |5J as we try to relate the annealed partition func- 
tion to its shorter length ones, we assume an effective 
binding pair between bases 1 and N-l simply to conserve 
the energy contribution. In this case, the two bases are 
not really paired. 

In order to keep track of the correct nearest neigh- 
bor correlations, we use a letter E on an arch to denote 
that the two bases at the ends of the arch are conju- 
gate bases. Similarly, a letter O is used to represent two 
non-conjugate bases at the ends of the arch. Thus, in 
Fig- El the t w0 cases where base N-l is either A or U are 
separated and are denoted by letter E and O, which is 
determined by whether the bases 1 and N-l are conjugate 
or not. These notations enable us to connect the decom- 
posed terms recursively back to the relation in Fig. [H] 

In Fig. ^3 an inner arch can be treated as a free base in 
considering the energy and correlations for the rest of the 
bases outside the inner arch. However, there is a differ- 
ence in counting neighbor correlations for this treatment 
because the free base looks as a base A from the right, 
but as a base U from the left. The correct correlations 
can be obtained if we shift this discrepancy to the last 
base and flip it from U to A. Thus, the last term carries 
a letter O on the arch instead of E. 

These recursive relations are then read as 

A e (N-l) = ^A e (N-2) + ^A a (N-2) + (16) 

N-2 

- MN -k-1) [LA Q (k - 1) + A e (k - 1)] , 

k=2 

A (N-1) = ^Ao(N-2) + ±A e (N-2)+ (17) 

N-2 

i Y, A e (N -k-1) [LA e (k - 1) + A a (k - 1)] . 

fe=2 
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FIG. 10: Separation of arches. 



Together with the initial conditions, 4(1) = q, 
4,(2) = qL, 4(1) = qL, 4(2) = q(l + L)/2, one can 
solve for A e (N) by performing the z-transform 



4(z) = J2m n )< 

jV=l 
oo 

A (z) = J2 A °( N > 



-N 



-N 



(18) 
(19) 



N=l 



on the recursive relations. After solving for A e (z), A e (N) 
can be obtained through the inverse transform 



A e (N) 



1 

2?ri 



A e (z)z N ^dz. 



(20) 



From previous studies ^} 1 we know that in the thermo- 
dynamic limit, the partition function has an analytical 
form as A e {N) oc N~ 3 / 2 z c (q,L) N , where z c (q,L) is the 
greatest real part among the branch points obtained from 
the solution of A e (z). 

Similarly, if we perform z-transform on equation 15, 
we can relate the z-transform of the annealed partition 
function Z a (z) to that of the arch A e (z). Since these two 
share the same branch points, the asymptotic behavior 
of the annealed partition function is different from the 
above formula for the arch by just a different prefactor, 
which does not play a role in the thermodynamic limit. 

The fraction of AA (or UU) neighboring bases per 
base of the annealed system is then easily calculated as 
L8l \n(z c (q, L))\l=i- Unfortunately, the analytical solu- 
tion of this set of polynomial equations is too cumber- 
some to convey any useful information. Thus, we resort 
to numerical evaluation of this analytical solution in this 
paper. 



B. Fraction of minimal hairpins at zero 
temperature 

As discussed in the main text, the fraction of minimal 
size hairpins can be easily obtained once we figure out the 
partition function. At zero temperature, the partition 
function is simpler than the finite temperature one since 
we only need to consider the ground states where all bases 
are paired. This partition function is obtained through 
the recursive relation in a similar way as shown in Fig. 1111 

We define the partition function for a sequence of 
length 2(N-1) as Z m (N,h), where h is the Boltzmann 



2N 



k=l 1 2k-l 



2IM 1 



2N-1 2N 



FIG. 11: Recursive relation for the partition function where 
all the bases are paired. 



factor for a minimal size hairpin. The recursive relation 
is then read as 

N-l 

Z m {N + l) = Z m (k)Z m (N - k + l)+hZ m {N). (21) 

k=l 

Together with the initial conditions, Z m (l) = 1 and 
Z m (2) = h, one can obtain the asymptotic behavior 
through z-transform. After simple algebra, we have the 
largest pole z c (h) = h + 2\fh + 1 for the z-transform 
of partition function Z m (z,h). The partition function 
Z m (N) is then proportional to z c (h) N . 

The fraction of minimal size hairpins per two bases is 
then easily calculated as 

d h lnz c (h)\ h=1 = l + l l f f i \ h=1 = l/2. (22) 
h + 2Vh + 1 

Thus, the fraction of minimal size hairpins per base is 
1/4. 



C. Annealed partition function for the stacking 
energy model 

The calculation for the stacking energy model follows 
the same approache. However, it is a bit more compli- 
cated since we need to keep track of stacking loops in- 
volving four bases which leads us to the recursive relation 
depicted in Fig. El 





FIG. 12: Recursive relation for the stacking energy model. 

In these recursive relations, we use an additional letter 
S on the arch to denote the fact that we consider the 
stacking energy of the stacking loop formed partly by 
that binding pair. Independent of the type of the arch, all 
the stacking energies inside the arches are still considered 
in all cases. Thus, the first term on the right hand side 
in Fig. E] does not contain an S because its base N-l is 
unbound, and no stacking loop can be formed with the 
binding pair of the arch. 

Similar to the recursive relation in previous section, 
we then discard the last base as a free base as shown 
in Fig. El Again, the arches on the right hand side 
are meant to preserve the energy contributions only. In 
the second line of the relation, we further decompose the 
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- £1 + 

N-1N 1 N-1N 1 N-1N 

U A U U A A U 

1 N-l 1 2 N-2N-ll I N-l 1 2 N-2N-1 

I A U A A U U / \ A AAA A / 

\ U A ' « U A ' 



FIG. 13: Decomposition of the annealed partition function 
which last base inside the arch is a free base. 



terms in order to relate these terms with the first recur- 
sive relation in Fig. 



1 k N-1N k N-l 
A U A U 




k+1 

A 



FIG. 14: 
bution. 



Separation of arches considering the hairpin contri- 



In Fig. we also separate the contributions of the 
inner arch from the rest part as in Fig. One differ- 
ence is that we consider the contribution from the hairpin 
loops in this case. Thus, the hairpin loop contained in 
the outer arch term is not a real hairpin loop because 
of the existence of the inner arch. The correct result is 
obtained by adding the last term in the relation. 

In this stacking energy model, we denote the annealed 
partition function for an arch of length N-l as A es (N). 
The recursive relations are then read as follows 



A es {N-l) = | ^ es (Af-2)-(si-l) (1+ 4 L2) A es (iV-4)j + i (A os (N-2)-(s2-l)^A es (N-4,)) + SlI? + 82 A es (N-3) (23) 



A os (N-l) 



N—J. 

-J2 A es (N-k-l) 



L ( A„ s (k-l)-(si-l)^A es (k-3)J + (A ee (k-l)-(s 2 -l)- + 



, -A ea (k-3) +2(--l)ff (fc) 
4 J h 



| (A os (N-2)~(si-l)^A es (N-4)\ + i ^A. s {N-2)~{s2-l)^-^—A ea {N~^ + SlL ± S2L A es (N~3) (24) 

JV-2 

r 



L fA es (k-l)-( Sl -l)^^-A es (k-3)) + (A os (k-l)-(s2-l)^A es (k-3) \ + 2(i -l)H e (k) 



where the terms H e and H stand for the contribution 
from a hairpin. They are obtained separately from a 
recursive relation similar to the one in Fig. [51 by just 
replacing the wavy line by a straight line, which means 
that bases are not bound. One can then easily formulate 
the recursive relations for H e and H Q . 



Together with the initial conditions: A es {\) = h, 
A es (2) = hL, A es (3) = h{\ + 3L 2 + s 2 + s^ 2 )/^ 
A os (l) = hL, A os {2) = h{l + L 2 )/2, A os {3) = h(3L + 
L 3 + s%L + S2^)/4, we can perform z-transform to ob- 
tain the asymptotic behavior of the annealed partition 
function for the stacking energy model. 
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